Video structuring by probabilistic merging of video segments

ABSTRACT

A method for structuring video by probabilistic merging of video segments includes the steps of obtaining a plurality of frames of unstructured video; generating video segments from the unstructured video by detecting shot boundaries based on color dissimilarity between consecutive frames; extracting a feature set by processing pairs of segments for visual dissimilarity and their temporal relationship, thereby generating an inter-segment visual dissimilarity feature and an inter-segment temporal relationship feature; and merging video segments with a merging criterion that applies a probabilistic analysis to the feature set, thereby generating a merging sequence representing the video structure. The probabilistic analysis follows a Bayesian formulation and the merging sequence is represented in a hierarchical tree structure.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of prior U.S. patent application Ser.No. 09/927,041, filed Aug. 9, 2001, the entire disclosure of which ishereby incorporated herein by reference.

FIELD OF THE INVENTION

The invention relates generally to the processing and browsing of videomaterial, and in particular to accessing, organizing and manipulatinginformation from home videos.

BACKGROUND

Among all the sources of video content, unstructured consumer videoprobably constitutes the content that most people are or wouldeventually be interested in dealing with. Organizing and editingpersonal memories by accessing and manipulating home videos represents anatural technological extension to the traditional still pictureorganization. However, although attractive with the advent of digitalvideo, such efforts remain limited by the size of these visual archives,and by the lack of efficient tools for accessing, organizing, andmanipulating home video information. The creation of such tools wouldalso open doors to the organization of video events in albums, videobaby books, editions of postcards with stills extracted from video data,multimedia family web-pages, etc. In fact, the variety of user interestssuggests an interactive solution, which requires a minimum amount ofuser feedback to specify the desired tasks at the semantic level, andwhich provides automated algorithms for those tasks that are tedious orcan be performed reliably.

In commercial video, many moving image documents have story structureswhich are reflected in the visual content. In such situations, acomplete moving image document is referred to as a video clip. Thefundamental unit of the production of video is the shot, which capturescontinuous action. The identification of video shots is achieved byscene change detection schemes which give the start and end of eachshot. A scene is usually composed of a small number of interrelatedshots that are unified by location or dramatic incident. Feature filmsare typically composed of a number of scenes, which define a storylinefor understanding the content of the moving image document.

In contrast with commercial video, unrestricted content and the absenceof storyline are the main characteristics of home video. Consumercontents are usually composed of a set of events, either isolated orrelated, each composed of one or a few shots, randomly spread alongtime. Such characteristics make consumer video unsuitable for videoanalysis approaches based on storyline models. However, there stillexists a spatio-temporal structure, based on visual similarity andtemporal adjacency between video segments (sets of shots) that appearsevident after a statistical analysis of a large home video database.Such structure, essentially equivalent to the structure of consumerstill images, points towards addressing home video structuring as aproblem of clustering. The task at hand could be defined as thedetermination of the number of clusters present in a given video clip,and the design of an optimality criterion for assigning cluster labelsto each frame/shot in the video sequence. This has indeed been thedirection taken by most research in video analysis, even when dealingwith storylined content.

For example, in U.S. Pat. No. 5,821,945, a technique is described forextracting a hierarchical decomposition of a complex video selection forbrowsing purposes, and combining visual and temporal information tocapture the important relations within a scene and between scenes in avideo. Thus, it is said, this allows the analysis of the underlyingstory structure with no a priori knowledge of the content. Suchapproaches perform video structuring in variations of a two-stagemethodology: video shot boundary detection (shot segmentation), and shotclustering. The first stage is by far the most studied in video analysis(see, e.g., U. Gargi, R. Kasturi and S. H. Strayer, “PerformanceCharacterization of Video-Shot-Change Detection Methods”, IEEE CSVT,Vol. 10, No. 1, February 2000, pp. 1-13). For the second stage, usingshots as the fundamental unit of video structure, K-means,distribution-based clustering, and time-constrained merging techniqueshave all been disclosed in the prior art. Some of these methods usuallyrequire setting of a number of parameters, which are eitherapplication-dependent or empirically determined by user feedback.

As understood in the prior art, hierarchical representations seem to benot only natural to represent unstructured content, but are probably thebest way of providing useful non-linear interaction models for browsingand manipulation. Fortunately, as a byproduct, clustering allows for thegeneration of hierarchical representations for video content. Differentmodels for hierarchical organization have also been proposed in theprior art, including scene transition graphs (e.g., see theaforementioned U.S. Pat. No. 5,821,945), and tables of contents based ontrees, although the efficiency/usability of each specific model remainsin general as an open issue.

To date, only a few works have dealt with analysis of home video (e.g.,see G. Iyengar and A. Lippman, “Content-based Browsing and Edition ofUnstructured Video”, IEEE ICME, New York City, August 2000; R. Lienhart,“Abstracting Home Video Automatically”, ACM Multimedia Conference,Orlando, October, 1999, pp. 37-41; and Y. Rui and T. S. Huang, “AUnified Framework for Video Browsing and Retrieval”, in A. C. Bovik,Ed., Handbook of Image and Video Processing, Academic Press, 1999). Thework in the Lienhart article uses time-stamp information to performclustering for generation of video summaries. Time-stamp information,however, might not always be available. Even though digital camerasinclude this information, users do not always use the time option.Therefore, a general solution cannot rely on this information. The workin the Rui and Huang article for generation of tables-of-contents, basedon very simple statistical assumptions, was tested on some home videoswith “storyline”. However, the highly unstructured nature of home videomakes the application of specific storyline models quite limited. Withthe exception of the Iyengar and Lippman article, none of the previousapproaches have analyzed in detail the inherent statistics of suchcontent. From this point of view, the present invention is more relatedto the work in N. Vasconcelos and A. Lippmann, “A Bayesian VideoModeling Framework for Shot Segmentation and Content Characterization”,Proc. CVPR, 1997, that proposes a Bayesian formulation for shot boundarydetection based on statistical models of shot duration, and to the workin the Iyengar and Lippmann article that addresses home video analysisusing a different probabilistic formulation.

Nonetheless, it is unclear from the prior art that a probabilisticmethodology that uses video shots as the unit of organization couldsupport the creation of a video hierarchy for interaction. In arrivingat the present invention, statistical models of visual and temporalfeatures in consumer video have been investigated for organizationpurposes. In particular, a Bayesian formulation seemed appealing toencode prior knowledge of the spatio-temporal structure of home video.In a departure from the prior art, the inventive approach describedherein is based on an efficient probabilistic video segment mergingalgorithm which integrates inter-segment features of visual similarity,temporal adjacency, and duration in a joint model that allows for thegeneration of video clusters without empirical parameter determination.

SUMMARY

The present invention is directed to overcoming one or more of theproblems set forth above. Briefly summarized, according to one aspect ofthe present invention, a method for structuring video by probabilisticmerging of video segments includes the steps of a) obtaining a pluralityof frames of unstructured video; b) generating video segments from theunstructured video by detecting shot boundaries based on colordissimilarity between consecutive frames; c) extracting a feature set byprocessing pairs of segments for visual dissimilarity and their temporalrelationship, thereby generating an inter-segment visual dissimilarityfeature and an inter-segment temporal relationship feature; and d)merging video segments with a merging criterion that applies aprobabilistic analysis to the feature set, thereby generating a mergingsequence representing the video structure. In the preferred embodiment,the probabilistic analysis follows a Bayesian formulation and themerging sequence is represented in a hierarchical tree structure thatincludes a frame extracted from each segment.

As described above, this invention employs methods for consumer videostructuring based on probabilistic models. More specifically, theinvention proposes a novel methodology to discover cluster structure inhome videos, using video shots as the unit of organization. Themethodology is based on two concepts: (i) the development of statisticalmodels (e.g., learned joint mixture Gaussian models) to represent thedistribution of inter-segment visual similarity and an inter-segmenttemporal relationship, including temporal adjacency and duration of homevideo segments, and (ii) the reformulation of hierarchical clustering(merging) as a sequential binary classification process. The models areused in (ii) in a probabilistic clustering algorithm, for which aBayesian formulation is useful since these models can incorporate priorknowledge of the statistical structure of home video, and which offersthe advantages of a principled methodology. Such prior knowledge can beextracted from the detailed analysis of the cluster structure of a realhome video database.

The video structuring algorithm can be efficiently implemented accordingto the invention and does not need any ad-hoc parameter determination.As a byproduct, finding video clusters allows for the generation ofhierarchical representations for video content, which provide nonlinearaccess for browsing and manipulation.

A principal advantage of the invention is that, based on the performanceof the methodology with respect to cluster detection and individualshot-cluster labeling, it is able to deal with unstructured video andvideo with unrestricted content, as would be found in consumer homevideo. Thus, it is the first step for building tools for a system forthe interactive organization and retrieval of home video information.

As a methodology for consumer video structuring based on a Bayesianvideo segment merging algorithm, another advantage is that the methodautomatically governs the merging process, without empirical parameterdetermination, and integrates visual and temporal segment dissimilarityfeatures in a single model.

Furthermore, the representation of the merging sequence by a treeprovides the basis for a user-interface that allows for hierarchical,non-linear access to the video content.

These and other aspects, objects, features and advantages of the presentinvention will be more clearly understood and appreciated from a reviewof the following detailed description of the preferred embodiments andappended claims, and by reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram providing a functional overview of videostructuring according to the present invention;

FIG. 2 is a flow graph of the video segment merging stage shown in FIG.1;

FIG. 3 is a distribution plot of consumer video shot duration for agroup of consumer images;

FIG. 4 is a scatter plot of labeled inter-segment feature vectorsextracted from a home video; and

FIG. 5 is a tree representation of key frames from a typical home video.

DETAILED DESCRIPTION OF THE INVENTION

Because video processing systems employing shot detection and clusteranalysis are well known, the present description will be directed inparticular to attributes forming part of, or cooperating more directlywith, a video structuring technique in accordance with the presentinvention. Attributes not specifically shown or described herein may beselected from those known in the art. In the following description, apreferred embodiment of the present invention would ordinarily beimplemented as a software program, although those skilled in the artwill readily recognize that the equivalent of such software may also beconstructed in hardware. Given the system as described according to theinvention in the following materials, software not specifically shown,suggested or described herein that is useful for implementation of theinvention is conventional and within the ordinary skill in such arts. Ifthe invention is implemented as a computer program, the program may bestored in conventional computer readable storage medium, which maycomprise, for example; magnetic storage media such as a magnetic disk(such as a floppy disk or a hard drive) or magnetic tape; opticalstorage media such as an optical disc, optical tape, or machine readablebar code; solid state electronic storage devices such as random accessmemory (RAM), or read only memory (ROM); or any other physical device ormedium employed to store a computer program.

Accessing, organizing and manipulating personal memories stored in homevideos constitutes a technical challenge, due to its unrestrictedcontent, and the lack of clear storyline structure. In this invention, amethodology is provided for structuring of consumer video, based on thedevelopment of parametric statistical models of similarity and adjacencybetween shots, the unit of visual information in consumer video clips. ABayesian formulation for merging of shots appears as a reasonable choiceas these models can encode prior knowledge of the statistical structureof home video. Therefore, the methodology is based on shot boundarydetection and Bayesian segment merging. Gaussian Mixture joint models ofinter-segment visual similarity, temporal adjacency and segmentduration—learned from home video training samples using theExpectation-Maximization (EM) algorithm—are used to represent theclass-conditional densities of the observed features. Such models arethen used in a merging algorithm consisting of a binary Bayesclassifier, where the merging order is determined by a variation ofHighest Confidence First (HCF), and the Maximum a Posteriori (MAP)criterion defines the merging criterion. The merging algorithm can beefficiently implemented by the use of a hierarchical queue, and does notneed any empirical parameter determination. Finally, the representationof the merging sequence by a tree provides the basis for auser-interface that allows for hierarchical, non-linear access to thevideo content.

Referring first to FIG. 1, the video structuring method is shown tooperate on a sequence of video frames stage 8 obtained from anunstructured video source, typically displaying an unrestricted content,such as found in consumer home videos. The salient features of the videostructuring method according to the invention can be conciselysummarized in the following four stages (which will be subsequentlydescribed in later sections in more detail):

1) The Video Segmentation Stage 10: Shot detection is computed byadaptive thresholding of a histogram difference signal. 1-D colorhistograms are computed in RGB space, with N=64 quantization levels foreach band. The L1 metric is used to represent the dissimilarity dc(t,t+1) between two consecutive frames. As a post-processing step, anin-place morphological hit-or-miss transform is applied to the binarysignal with a pair of structuring elements that eliminate the presenceof multiple adjacent shot boundaries.

2) The Video Shot Feature Extraction Stage 12: It is known in the artthat visual similarity is not enough to differentiate between twodifferent video events (e.g., see the Rui and Huang article). Bothvisual similarity and temporal information have been used for shotclustering in the prior art. (However, the statistical properties ofsuch variables have not been studied under a Bayesian perspective.) Inthis invention, three main features in a video sequence are utilized ascriteria for subsequent merging:

-   -   Visual similarity is described by the mean segment histogram        that represents segment appearance. The mean histogram        represents both the presence of the dominant colors and their        persistence within the segment.    -   Temporal separation between segments is a strong indication of        their belonging to the same cluster.    -   Combined temporal duration of two individual segments is also a        strong indicator about their belonging to the same cluster        (e.g., two long shots are not likely to belong to the same video        cluster).

3) The Video Segment Merging Stage 14: This step is carried out byformulating a two-class (merge/not merge) pattern classifier based onBayesian decision theory. Gaussian Mixture joint models of inter-segmentvisual similarity, temporal adjacency and segment duration—learned fromhome video training samples using the Expectation-Maximization (EM)algorithm—are used to represent the class-conditional densities of theobserved features. Such models are then used in a merging algorithmcomprising a binary Bayes classifier, where the merging order isdetermined by a variation of Highest Confidence First (HCF), and theMaximum a Posteriori (MAP) criterion defines the merging criterion. Themerging algorithm can be efficiently implemented by the use of ahierarchical queue, and does not need any empirical parameterdetermination. A flow graph of the merging procedure is given in FIG. 2and will be described in further detail later in this description.

4) The Video Segment Tree Construction Stage 16: The merging sequence,i.e. a list with the successive merging of pairs of video segments, isstored and used to generate a hierarchy, whose merging sequence isrepresented by a binary partition tree 18. FIG. 5 shows a treerepresentation from a typical home video.

1. An Overview of the Approach

Assume a feature vector representation for video segments, i.e., supposethat a video clip has been divided into shots or segments (where asegment is composed of one or more shots), and that features thatrepresent them have been extracted. Any clustering procedure shouldspecify mechanisms both to assign cluster labels to each segment in thehome video clip and to determine the number of clusters (where a clustermay encompass one or more segments). The clustering process needs toinclude time as a constraint, as video events are of limited duration(e.g., see the Rui and Huang article). However, the definition of ageneric generative model for intra-segment features in home videos isparticularly difficult, given their unconstrained content. Instead,according to the present invention, home video is analyzed usingstatistical inter-segment models. In other words, the invention proposesto build up models that describe the properties of visual and temporalfeatures defined on pairs of segments. Inter-segment features naturallyemerge in a merging framework, and integrate visual dissimilarity,duration, and temporal adjacency. A merging algorithm can be thought ofas a classifier, which sequentially takes a pair of video segments anddecides whether they should be merged or not. Let s_(i) and s_(j) denotethe i-th and j-th video segments in a video clip, and let ε be a binaryrandom variable (r.v.) that indicates whether such pair of segmentscorrespond to the same cluster and should be merged or not. Theformulation of the merging process as a sequential two-class (merge/notmerge) pattern classification problem allows for the application ofconcepts from Bayesian decision theory (for a discussion of Bayesiandecision theory, see, e.g., R. O. Duda, P. E. Hart and D. G. Stork,Pattern Classification, 2^(nd) ed., John Wiley and Sons, 2000). TheMaximum a Posteriori (MAP) criterion establishes that given ann-dimensional realization x_(ij) of an r.v. x (representinginter-segment features and detailed later in the specification), theclass that must be selected is the one that maximizes the a posterioriprobability mass function of ε given x, i.e.,$ɛ^{*} = {\underset{ɛ}{\arg\quad\max}{\Pr\left( {ɛ❘x} \right)}}$By Bayes rule,${\Pr\left( {ɛ❘x} \right)} = \frac{{p\left( x \middle| ɛ \right)}{\Pr(ɛ)}}{p(x)}$where p(x|ε) is the likelihood of x given ε, and Pr(ε) is the prior ofε, and p(x) is the distribution of the features. The application of theMAP principle can then be expressed $ɛ^{*} = \left\{ \begin{matrix}{1,} & {{{p\left( {{x❘ɛ} = 1} \right)}{\Pr\left( {ɛ = 1} \right)}} > {{p\left( {{x❘ɛ} = 0} \right)}{\Pr\left( {ɛ = 0} \right)}}} \\0 & {otherwise}\end{matrix} \right.$or in standard hypothesis testing notation, the MAP principle can beexpressed as${p\left( {{x❘ɛ} = 1} \right)}{\Pr\left( {ɛ = 1} \right)}\begin{matrix}\underset{>}{H_{1}} \\\overset{<}{H_{0}}\end{matrix}{p\left( {{x❘ɛ} = 0} \right)}{\Pr\left( {ɛ = 0} \right)}$where H₁ denotes the hypothesis that the pair of segments should bemerged, and H₀ denotes the opposite. With this formulation, theclassification of pairs of shots is performed sequentially, until acertain stop criteria is satisfied. Therefore, the tasks are thedetermination of a useful feature space, the selection of models for thedistributions, and the specification of the merging algorithm. Each ofthese steps are described in the following sections of the description.2. Video Segmentation

To generate the basic segments, shot boundary detection is computed instage 10 by a series of methods to detect the cuts usually found in homevideo (see, e.g., U. Gargi, R. Kasturi and S. H. Strayer, “PerformanceCharacterization of Video-Shot-Change Detection Methods”, IEEE CSVT,Vol. 10, No. 1, February 2000, pp. 1-13). Over-segmentation due todetection errors (e.g. due to illumination or noise artifacts) can behandled by the clustering algorithm. Additionally, videos of very poorquality are removed.

In implementing a preferred embodiment of the invention, shot detectionis determined by adaptive thresholding of a histogram difference signal.1-D color histograms are computed in the RGB space, with N=64quantization levels for each band. Other color models (LAB or LUV) couldbe used, and might provide better shot detection performance, but atincreased computational cost. The L1 metric is used to represent thecolor dissimilarity dc(t,t+1) between two consecutive frames:${d_{C}\left( {t,{t + 1}} \right)} = {\sum\limits_{k = 1}^{3N}{{h_{t}^{k} - h_{t + 1}^{k}}}}$where h_(t) ^(k) denotes the value of the k-th bin for the concatenatedRGB histogram of frame t. The 1-D signal d_(C) is then binarized by athreshold that is computed on a sliding window centered at time t oflength fr/2, where fr denotes the frame rate.${s(t)} = \left\{ \begin{matrix}1 & {{d_{C}(t)} > {{\mu_{d}(t)} + {k\quad{\sigma_{d}(t)}}}} \\0 & {otherwise}\end{matrix} \right.$where μ_(d)(t) denotes the mean of dissimilarities computed on thesliding window, σ_(d)(t) denotes the mean absolute deviation of thedissimilarity within the window, which is known to be a more robustestimator of the variability of a data set around its mean, and k is afactor that sets the confidence interval for determination of thethreshold, set in the interval. Consecutive frames are therefore deemedto belong to the same shot if s(t)=0, and a shot boundary betweenadjacent frames is identified when s(t)=1.

As a post-processing step, an in-place morphological hit-or-misstransform is applied on the binary signal with a pair of structuringelements that eliminate the presence of multiple adjacent shotboundaries,b(t)=s(t){circle around (×)}(e ₁(t),e ₂(t))where {circle around (×)} denotes hit-or-miss, and the size of thestructuring elements is based on the home video shot duration histograms(home video shots are unlikely to last less than a few seconds), and itis set to fr/2 (see Jean Serra: Image Analysis and MathematicalMorphology, Vol. 1, Academic Press, 1982).3. Video Inter-Segment Feature Definition

A feature set for visual dissimilarity, temporal separation andaccumulated segment duration is generated in the video shot featureextraction stage 12. Both visual dissimilarity and temporal information,particularly temporal separation, have been used for clustering in thepast. In the case of visual dissimilarity, and in terms of discerningpower of a visual feature, it is clear that a single frame is ofteninsufficient to represent the content of a segment. From the severalavailable solutions, the mean segment color histogram is selected torepresent segment appearance,$m_{i} = {\frac{1}{M_{i}}{\sum\limits_{t = b_{i}}^{e_{i}}h_{t}}}$where h_(t) denotes the t-th color histogram, and m_(i) denotes the meanhistogram of segment s_(i), each consisting of M_(i)=e_(i)−b_(i)+1frames (b_(i) and e_(i) denote the beginning and ending frame of segments_(i)). The mean histogram represents both the presence of the dominantcolors and their persistence within the segment. The L1 norm of the meansegment histogram difference is used to visually compare a pair ofsegments i and j,$\alpha_{ij} = {\sum\limits_{k = 1}^{B}{{m_{ik} - m_{jk}}}}$where α_(ij) denotes visual dissimilarity between segments i and j, B isthe number of histogram bins, m_(ik) is the value of the k-th bin of themean color histogram of segment s_(i), and m_(jk) is the value of thek-th bin of the mean color histogram of segment s_(j).

In the case of temporal information, the temporal separation betweensegments s_(i) and s_(j), which is a strong indication of theirbelonging to the same cluster, is defined asβ_(il)=min(|e _(i) −b _(j) |,|e _(j) −b _(i)|)(1−δ_(ij))where δ_(ij) denotes a Kronecker's delta, b_(i), e_(i) denote first andlast frames of segment s_(i), and b_(j),e_(j) denote first and lastframes of segment s_(j).

Additionally, the accumulated segment (combined) duration of twoindividual segments is also a strong indication about their belonging tothe same cluster. FIG. 3 shows the empirical distribution of home videoshot duration for approximately 660 shots from a database withground-truth, and its fitting by a Gaussian mixture model (see nextsubsection). (In FIG. 3, the empirical distribution, and an estimatedGaussian mixture model consisting of six components, are superimposed.Duration was normalized to the longest duration found in the database(580 sec.).) Even though videos correspond to different scenarios andwere filmed by multiple people, a clear temporal pattern is present (seealso the Vasconcelos and Lippmann article). The accumulated segmentduration τ_(ij) is defined asτ_(ij)=card(s _(i))+card(s _(j))where card(s) denotes the number of frames in segment s.4. Modeling of Likelihoods and Priors

The statistical modeling of the inter-segment feature set is generatedin the video segment merging stage 14. The three described featuresbecome the components of the feature space x, with vectors x=(α,β,τ). Toanalyze the separability of the two classes, FIG. 4 shows a scatteringplot of 4000 labeled inter-segment feature vectors extracted from homevideo. (Half of the samples correspond to hypothesis H₁ (segment pairbelongs together, labeled with light gray), and the other half to H₀(segment pair does not belong together, labeled with dark gray). Thefeatures have been normalized.)

The plot indicates that the two classes are in general separated. Aprojection of this plot clearly illustrates the limits of relying onpure visual similarity. A parametric mixture model is adopted for eachof the class-conditional densities of the observed inter-segmentfeatures,${p\left( {{x❘ɛ},\Theta} \right)} = {\sum\limits_{i = 1}^{K_{ɛ}}{{\Pr\left( {c = i} \right)}{p\left( {{x❘ɛ},\theta_{i}} \right)}}}$where K_(c) is the number of components in each mixture, Pr(c=i) denotesthe prior probability of the i-th component, p(x|ε,θ_(i)) is the i-thpdf parameterized by θ_(i), and Θ={Pr(c),{θ_(i)}} represents the set ofall parameters. In this invention, we assume multivariate Gaussian formsfor the components of the mixtures in d-dimensions${p\left( {{x❘ɛ},\theta_{i}} \right)} = {\frac{1}{\left( {2\pi} \right)^{d/2}{\sum\limits_{i}}^{1/2}}{\mathbb{e}}^{{- \frac{1}{2}}{({x - \mu_{i}})}^{T}{\sum\limits_{i}^{- 1}{({x - \mu_{i}})}}}}$so that the parameters θ_(i) are the means μ_(i) and covariance matricesΣ_(i) (see Duda et al., Pattern Classification, op. cit.).

The well-known expectation-maximization (EM) algorithm constitutes thestandard procedure for Maximum Likelihood estimation (ML) of the set ofparameters o (see A. P. Dempster, N. M. Laird and D. B. Rubin, “MaximumLikelihood from Incomplete Data via the EM Algorithm”, Journal of theRoyal Statistical Society, Series B, 39:1-38, 1977). EM is a knowntechnique for finding ML estimates for a broad range of problems wherethe observed data is in some sense incomplete. In the case of a GaussianMixture, the incomplete data are the unobserved mixture components,whose prior probabilities are the parameters {Pr(c)}. EM is based onincreasing the conditional expectation of the log-likelihood of thecomplete data given the observed data by using an iterativehill-climbing procedure. Additionally, model selection, i.e., the numberof components of each mixture can be automatically estimated using theMinimum Description Length (MDL) principle (see J. Rissanen, “Modelingby Shortest Data Description”, Automatica, 14:465-471, 1978).

The general EM algorithm, valid for any distribution, is based inincreasing the conditional expectation of the log-likelihood of thecomplete data Y given the observed data X={x₁, . . . , x_(N)}:Q(θ|θ^((p)))=E{log p(Y|θ)|x,θ ^((p))}by using an iterative hill-climbing procedure. In the previous equation,X=h(Y) denotes a known many-to-one function (for example, a subsetoperator), x represents a sequence or vector of data, and p is ansuperscript that denotes the iteration number. The EM algorithm iteratesthe following two steps until convergence to maximize Q(θ):

-   E-step: Find the expected likelihood of the complete data as a    function of θ, Q(θ|θ^((p))).-   M-step: Re-estimate parameters, according to    $\theta^{({p + 1})} = {\arg\quad{\max\limits_{\theta}{Q\left( \theta \middle| \theta^{(p)} \right)}}}$

In other words, firstly estimate values to fill in for the incompletedata in the E-Step (using the conditional expectation of thelog-likelihood of the complete data given the observed data, instead ofthe log-likelihood itself). Then, compute the maximum likelihoodparameter estimate using in the M-step, and repeat until a suitablestopping criterion is reached. EM is an iterative algorithm thatconverges to a local maximum of the likelihood of the sample set.

For the specific case of multivariate Gaussian models, the complete datais given by Y=(X, I), where I indicates the Gaussian component that hasbeen used in generating each sample of the observed data. Element-wise,y=(x,i), i∈{1, . . . , K_(ε)}. In this case, EM takes a furthersimplified form:

-   E-step: For all N training samples, and for all mixture components,    compute the probability that Gaussian i fits the sample x_(j) given    the current estimation Θ^((p)),    ${p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)} = \frac{\pi_{i}{p\left( {\left. x_{j} \middle| ɛ \right.,\theta_{i}^{(p)}} \right)}}{\sum\limits_{k = 1}^{K_{ɛ}}\quad{\pi_{k}{p\left( {\left. x_{j} \middle| ɛ \right.,\theta_{k}^{(p)}} \right)}}}$-   M-step: Re-estimate parameters,    $\pi_{i}^{({p + 1})} = {\frac{1}{N}{\sum\limits_{j = 1}^{N}\quad{p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)}}}$    $\mu_{i}^{({p + 1})} = \frac{\sum\limits_{j = 1}^{N}\quad{x_{j}{p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)}}}{\sum\limits_{j = 1}^{N}\quad{p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)}}$    $\sum\limits_{i}^{({p + 1})}\quad{= \frac{\sum\limits_{j = 1}^{N}\quad{{p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)}\left( {x_{j} - \mu_{i}^{({p + 1})}} \right)\left( {x_{j} - \mu_{i}^{({p + 1})}} \right)^{T}}}{\sum\limits_{j = 1}^{N}\quad{p\left( {\left. i \middle| x_{j} \right.,ɛ,\Theta^{(p)}} \right)}}}$    The mean vectors and covariance matrices for each of the mixture    components must be initialized in the first place. In this    implementation, the means are initialized using the traditional    K-means algorithm, while the covariance matrices are initialized    with the identity matrix. As other hill climbing methods,    data-driven initialization usually performs better than pure random    initialization. Additionally, on successive restarts of the EM    iteration, a small amount of noise is added to each mean, to    diminish the procedure to be trapped in local maxima.

The convergence criterion is defined by the rate of increase on thelog-likelihood of the observed data in successive iterations,${\log\quad{L\left( \Theta \middle| X \right)}} = {\log\quad{\prod\limits_{j = 1}^{N}\quad{p\left( {\left. x_{j} \middle| ɛ \right.,\Theta} \right)}}}$i.e., the EM iteration is terminated when$\frac{{\log\quad{L\left( \Theta^{({p + 1})} \middle| X \right)}} - {\log\quad{L\left( \Theta^{(p)} \middle| X \right)}}}{\log\quad{L\left( \Theta^{(p)} \middle| X \right)}} \leq 10^{- 2}$

The specific model, i.e., the number of components K_(ε) of each mixtureis automatically estimated using the Minimum Description Length (MDL)principle, by choosing$K_{ɛ}^{*} = {\underset{K_{ɛ}}{\arg\quad\max}\left( {{\log\quad{L\left( \Theta \middle| X \right)}} - {\frac{n_{K_{ɛ}}}{2}\log\quad N}} \right)}$where L(•) denotes the likelihood of the training set, and n_(K) _(ε) isthe number of parameters needed for the model, which for a Gaussianmixture is equal to$n_{K_{ɛ}} = {\left( {K_{ɛ} - 1} \right) + {K_{ɛ}d} + {K_{ɛ}\frac{d\left( {d + 1} \right)}{2}}}$When two models fit the sample data in a similar way, the simpler model(smaller K_(ε)) is chosen.

Instead of imposing independence assumptions among the variables, thefull joint class-conditional pdfs are estimated. The ML estimation ofthe parametric models for p(x|ε=0) and p(x|ε=1), by the procedure justdescribed, produces probability densities represented by ten componentsin both cases, respectively.

In the Bayesian approach, the prior probability mass function Pr(ε)encodes all the previous knowledge at hand about the specific problem.In this particular case, this represents the knowledge or belief aboutthe merging process characteristics (home video clusters mostly consistof only a few shots). There exist a variety of solutions that can beexplored:

-   The simplest assumption is Pr(ε=0)=Pr(ε=1)=½, which turns the MAP    criterion into the ML criterion.-   The priors themselves can be ML-estimated from training data (see    Duda et al., Pattern Classification, op. cit.). It is    straightforward to show that, assuming that the N are independent,    the ML estimator of the priors is    ${\Pr\left( {ɛ = e} \right)} = {\frac{1}{N}{\sum\limits_{k = 1}^{N}\quad{\iota\left( {e,k} \right)}}}$

where t(e, k) is equal to one if the k-th training sample belongs to theclass represented by ε=e, e∈{0,1}, and zero otherwise. In other words,the priors are simply weights determined by the available evidence (thetraining data).

-   The dynamics involved in the merging algorithm (presented in the    following section) also influences the prior knowledge in a    sequential manner (it is expected that more segments will be merged    at the beginning of the process, and less at the end). In other    words, the prior can be dynamically updated based on this rationale.    5. Video Segment Clustering

The merging algorithm is implemented in the video segment merging stage14. Any merging algorithm requires three elements: a feature model, amerging order, and a merging criterion (L. Garrido, P. Salembier, D.Garcia, “Extensive Operators in Partition Lattices for Image SequenceAnalysis”, Sign. Proc., 66(2): 157-180, 1998). The merging orderdetermines which clusters should be probed for possible merging at eachstep of the process. The merging criterion decides whether the mergingshould occur or not. The feature model of each cluster should be updatedif a merging occurs. The present video segment clustering method usesthis general formulation, based on the statistical inter-segment modelsdeveloped in the previous section. In the present algorithm, theclass-conditionals are used to define both the merging order and themerging criterion.

Merging algorithms can be efficiently implemented by the use ofadjacency graphs and hierarchical queues, which allow for prioritizedprocessing. Elements to be processed are assigned a priority, andintroduced into the queue according to it. Then, the element that isextracted at each step is the one that has the highest priority.Hierarchical queues are now traditional tools in mathematicalmorphology. Their use in Bayesian image analysis first appeared in C.Chou and C. Brown, “The Theory and Practice of Bayesian Image Labeling”,IJCV, 4, pp. 185-210, 1990, with the Highest Confidence First (HCF)optimization method. The concept is intuitively appealing: at each step,decisions should be made based on the piece of information that has thehighest certainty. Recently, similar formulations have appeared inmorphological processing.

As shown in FIG. 2, the segment merging method comprises two stages: aqueue initialization stage 20 and a queue updating/depletion stage 30.The merging algorithm comprises a binary Bayes classifier, where themerging order is determined by a variation of Highest Confidence First(HCF), and the Maximum a Posteriori (MAP) criterion defines the mergingcriterion.

Queue initialization. At the beginning (22) of the process, inter-shotfeatures x_(ij) are computed for all pairs of adjacent shots in thevideo. Each feature x_(ij) is introduced (24) in the queue with priorityequal to the probability of merging the corresponding pair of shots,Pr(ε=1|x_(ij)).

Queue depletion/updating. The definition of priority allows makingdecisions always on the pair of segments of highest certainty. Until thequeue is empty (32), the procedure is as follows:

1. In the element extraction stage 34, extract an element (pair ofsegments) from the queue. This element is the one that has the highestpriority.

2. Apply the MAP criterion (36) to merge the pair of segments, i.e.,p(x _(ij)|ε=1)Pr(ε=1)>p(x _(ij)|ε=0)Pr(ε=0)

3. If the segments are merged (the path 38 indicating the application ofhypothesis H₁), update the model of the merged segment in segment modelupdating stage 40, then update the queue in the queue updating stage 42based on the new model, and go to step 1. Otherwise, if the segments arenot merged (the path 44 indicating the application of hypothesis H₀), goto step 1.

When a pair of segments is merged, the model of the new segment s_(i)′is updated bym _(i′)=(card(s _(i))m _(i)+card(s _(j))m _(j))/(card(s_(i))+card(s_(j)))b _(i′)=min(b _(i) ,b _(j))e _(i′)=max(e _(i) ,e _(j))card(s _(i′))=card (s _(i))+card(s_(j))

After having updated the model of the (new) merged segment, fourfunctions need to be implemented to update the queue:

1. Extraction from the queue of all those elements that involved theoriginally individual (now merged) segments.

2. Computation of new inter-segment features x=(α,β,τ) using the updatedmodel.

3. Computation of new priorities Pr(ε=1|x_(ij)).

4. Insertion in the queue of elements according to new priorities.

Note that, unlike many previous methods (such as described in the Ruiand Huang article), this formulation does not need any empiricalparameter determination.

The merging sequence, i.e., a list with the successive merging of pairsof video segments, is stored and used to generate a hierarchy.Furthermore, for visualization and manipulation, after emptying thehierarchical queue in the merging algorithm, further merging of videosegments is allowed to build a complete merging sequence that convergesinto a single segment (the whole video clip). The merging sequence isthen represented by a partition tree 18 (FIG. 1), which is known to bean efficient structure for hierarchical representation of visualcontent, and provides the starting point for user interaction.

6. Video Hierarchy Visualization.

An example of a tree representation stage 50 appears in FIG. 5. Aprototype of an interface to display the tree representation of theanalyzed home video may be based on key frames, that is, a frameextracted from each segment. A set of functionalities that allow formanipulation (correction, augmentation, reorganization) of theautomatically generated video clusters, along with cluster playback, andother VCR capabilities may be applied to the representation. The usermay parse the video using this tree representation, retrieve previewclips and do video editing.

Queue-based methods with real-valued priorities can be very efficientlyimplemented using binary search trees, where the operations ofinsertion, deletion and minimum/maximum location are straightforward. Inthe preferred embodiment of the invention, the implementation is relatedto the description in L. Garrido, P. Salembier and L. Garcia, “ExtensiveOperators in Partition Lattices for Image Sequence Analysis”, SignalProcessing, (66), 2, 1998, pp. 157-180.

The merging sequence, i.e. a list with the successive merging of pairsof video segments, is stored and used to generate a hierarchy. The firstlevel 52 in the hierarchy is defined by key frames from the individualsegments provided by the video segmentation stage 10. The second levelstage 54 in the hierarchy is defined by key frames from the clustersgenerated by the algorithm used in the segment merging stage 14.

For visualization and manipulation, after emptying the hierarchicalqueue in the merging algorithm, further merging of video segments isallowed to build a complete merging sequence that converges into asingle segment (i.e., the key frame stage 56 represents the whole videoclip). The whole video clip therefore constitutes the third level of thehierarchy. The merging sequence is then represented by a BinaryPartition Tree (BPT), which is known to be an efficient structure forhierarchical representation of visual content. In a BPT, each node (withexception of the leaves, which correspond to the initial shots) has twochildren. (P. Salembier, L. Garrido, “Binary Partition Tree as anEfficient Representation for Filtering, Segmentation, and InformationRetrieval”, IEEE Intl. Conference on Image Processing, ICIP '98,Chicago, Ill., Oct. 4-7, 1998.) The BPT also provides the starting pointto build a tool for user interaction.

The tree representation provides an easy-to-use interface forvisualization and manipulation (verification, correction, augmentation,reorganization) of the automatically generated video clusters. Given thegenerality of home video content and the variety of user preferences,manual feedback mechanisms may improve the generation of video clusters,and additionally give users the possibility of actually doing somethingwith their videos.

In a simple interface for displaying the tree representation 50 of themerging process, an implementing program would read a merging sequence,and build the binary tree, representing each node of the sequence by aframe extracted from each segment. A random frame represents each leaf(shot) of the tree. Each parent node is represented by the childrandom-frame with smaller shot number. (Note that the term “random” maybe preferred instead of “keyframe” because no effort was done inselecting it). Note also that the representation shown in FIG. 5 isuseful to visualize the merging process, identify erroneous clusters, orfor general display when the number of shots is small, but it can becomevery deep to display when the original number of shots is large.

A second version of the interface could display only the three levels ofthe hierarchy, i.e., the leaves of the tree, the clusters that wereobtained as the result of the probabilistic merging algorithm, and thecomplete-video node. This mode of operation should allow for interactivereorganization of the merging sequence, so that the user can freelyexchange video segments among clusters, combine clusters from multiplevideo clips, etc. Integration of either interface with other desiredfeatures, like playback of preview sequences when clicking on the treenodes, and VCR capabilities, should be clear to those skilled in thisart.

The invention has been described with reference to a preferredembodiment. However, it will be appreciated that variations andmodifications can be effected by a person of ordinary skill in the artwithout departing from the scope of the invention. Although thepreferred embodiment of the invention has been described for use withconsumer home videos, it should be understood that the invention can beeasily adapted for other applications, including without limitation thesummarization and storyboarding of digital movies generally, theorganization of video materials from news and product-relatedinterviews, health imaging applications where motion is involved, andthe like.

Parts List

-   8 video frames-   10 video segmentation stage-   12 video shot feature extraction stage-   14 video segment merging stage-   16 video segment tree construction-   18 binary partition tree-   20 queue initialization stage-   22 beginning of process-   30 queue depletion/updating stage-   32 queue empty decision-   34 element extraction stage-   36 MAP criterion application-   38 path for hypothesis H₁-   40 segment model updating stage-   42 queue updating stage-   44 path for hypothesis H₂-   50 tree representation-   52 first level-   54 second level-   56 whole video clip

1. A method for representing contents of a video sequence, the videosequence including individual shots, and each shot including individualframes, the method comprising the steps of: generating a treerepresentation of the video sequence; and storing the treerepresentation in a computer readable memory, wherein the treerepresentation including levels of nodes, wherein the levels include atop level including only a single complete-video node, a bottom levelincluding a plurality of leaf nodes, and an intermediate level includinga plurality of nodes fewer in number than the plurality of leaf nodes,wherein each node corresponds to one or more of the shots and each nodeis represented in the tree representation by a frame of itscorresponding shot(s), wherein each leaf node corresponds to one of theshots, wherein the complete-video node corresponds to all of the shotsin the video sequence, and wherein each node in the intermediate levelcorresponds to a cluster of shots, but not all of the shots in the videosequence.
 2. The method of claim 1, further comprising the step ofdisplaying the tree representation on a display.
 3. The method of claim1, further comprising the step of changing the cluster of shotsassociated with a node in the intermediate level based at least uponuser feedback.
 4. The method of claim 2, further comprising the stepsof: changing the cluster of shots associated with a node in theintermediate level based at least upon user feedback; and displaying thetree representation on the display based at least upon results of thechanging step.
 5. The method of claim 2, further comprising the stepsof: receiving a selection of a node; and initiating playback of theshot(s) that correspond(s) to the selected node.
 6. The method of claim5, wherein the initiating playback step includes providing playbackcontrol functionality to a user.
 7. The method of claim 6, wherein theplayback control functionality includes pause, stop, fast-forward, andrewind.
 8. The method of claim 1, wherein each frame representing a nodein the tree representation is a key frame.
 9. The method of claim 2,further comprising the step of editing the video sequence based at leastupon user interaction with the tree representation.
 10. The method ofclaim 2, further comprising the steps of: receiving a selection of anode; retrieving the shot(s) corresponding to the selected node; andstoring the retrieved shot(s) in a computer-readable memory separatefrom the video sequence.
 11. The method of claim 10, further comprisingthe step of deleting the retrieved shot(s) from the video sequence. 12.A processor-accessible memory system storing instructions configured tocause a data processing system to implement a method for representingcontents of a video sequence, the video sequence including individualshots, and each shot including individual frames, wherein theinstructions comprise: instructions for generating a tree representationof the video sequence; and instructions for storing the treerepresentation in a computer readable memory, wherein the treerepresentation including levels of nodes, wherein the levels include atop level including only a single complete-video node, a bottom levelincluding a plurality of leaf nodes, and an intermediate level includinga plurality of nodes fewer in number than the plurality of leaf nodes,wherein each node corresponds to one or more of the shots and each nodeis represented in the tree representation by a frame of itscorresponding shot(s), wherein each leaf node corresponds to one of theshots, wherein the complete-video node corresponds to all of the shotsin the video sequence, and wherein each node in the intermediate levelcorresponds to a cluster of shots, but not all of the shots in the videosequence.
 13. The processor-accessible memory system of claim 12,wherein the instructions further comprise instructions for displayingthe tree representation on a display.
 14. The processor-accessiblememory system of claim 12, wherein the instructions further compriseinstructions for changing the cluster of shots associated with a node inthe intermediate level based at least upon user feedback.
 15. Theprocessor-accessible memory system of claim 13, wherein the instructionsfurther comprise: instructions for receiving a selection of a node; andinstructions for initiating playback of the shot(s) that correspond(s)to the selected node.
 16. The processor-accessible memory system ofclaim 13, wherein the instructions further comprise instructions forediting the video sequence based at least upon user interaction with thetree representation.
 17. The processor-accessible memory system of claim13, wherein the instructions further comprise: instructions forreceiving a selection of a node; instructions for retrieving the shot(s)corresponding to the selected node; and instructions for storing theretrieved shot(s) in a computer-readable memory separate from the videosequence.
 18. A system comprising: a data processing system; and amemory system communicatively connected to the data processing systemand storing instructions configured to cause the data processing systemto implement a method for representing contents of a video sequence, thevideo sequence including individual shots, and each shot includingindividual frames, wherein the instructions comprise: instructions forgenerating a tree representation of the video sequence; and instructionsfor storing the tree representation in a computer readable memory,wherein the tree representation including levels of nodes, wherein thelevels include a top level including only a single complete-video node,a bottom level including a plurality of leaf nodes, and an intermediatelevel including a plurality of nodes fewer in number than the pluralityof leaf nodes, wherein each node corresponds to one or more of the shotsand each node is represented in the tree representation by a frame ofits corresponding shot(s), wherein each leaf node corresponds to one ofthe shots, wherein the complete-video node corresponds to all of theshots in the video sequence, and wherein each node in the intermediatelevel corresponds to a cluster of shots, but not all of the shots in thevideo sequence.
 19. The system of claim 18, wherein the instructionsfurther comprise: instructions for displaying the tree representation ona display; and instructions for editing the video sequence based atleast upon user interaction with the tree representation.
 20. The systemof claim 13, wherein the instructions further comprise: instructions fordisplaying the tree representation on a display; instructions forreceiving a selection of a node; instructions for retrieving the shot(s)corresponding to the selected node; and instructions for storing theretrieved shot(s) in a computer-readable memory separate from the videosequence.